AITopics | level 3

Collaborating Authors

level 3

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

4547dff5fd7604f18c8ee32cf3da41d7-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 16:04:14 GMT

In training every agent we use a distributed framework for simulation and training. For simulation, we run 6400 Hanabi environments in parallel and the trajectories are batched together for efficient GPU computation. This is done efficiently as every thread can hold many environments in which many agents interact. Every agent chooses actions based on neural network calls, which are more intensive and done by GPUs. By doing these calls asynchronously it allows a thread to support multiple environments while waiting for prior agents' actions to be computed.

agent, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.54)

Add feedback

Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

Mustahsan, Zairah, Lim, Abel, Anand, Megna, Jain, Saahil, McCann, Bryan

arXiv.org Artificial IntelligenceDec-9-2025

As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current evaluation practice, reporting a single accuracy number from a single run, obscures the variance underlying these results, making it impossible to distinguish genuine capability improvements from lucky sampling. We propose adopting Intraclass Correlation Coefficient (ICC), a metric from measurement science, to characterize this variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency), highlighting whether reported results reflect true capability or measurement noise. We evaluated on GAIA (Levels 1-3, measuring agentic capabilities across varying reasoning complexity) and FRAMES (measuring retrieval and factuality across multiple documents). We found that ICC varies dramatically with task structure, with reasoning and retrieval tasks (FRAMES) exhibit ICC=0.4955-0.7118 across models, and agentic tasks (GAIA) exhibiting ICC=0.304-0.774 across models. For sub-agent replacement decisions in agentic systems, accuracy improvements are only trustworthy if ICC also improves. We demonstrate that ICC converges by n=8-16 trials for structured tasks and n>=32 for complex reasoning, enabling practitioners to set evidence-based resampling budgets. We recommend reporting accuracy alongside ICC and within-query variance as standard practice, and propose updated Evaluation Cards capturing these metrics. By making evaluation stability visible, we aim to transform agentic benchmarking from opaque leaderboard competition to trustworthy experimental science. Our code is open-sourced at https://github.com/youdotcom-oss/stochastic-agent-evals.

large language model, machine learning, variance, (20 more...)

arXiv.org Artificial Intelligence

2512.0671

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Zhong, Ming, Wang, Yuanlei, Zhang, Liuzhou, An, Arctanx, Zhang, Renrui, Liang, Hao, Lu, Ming, Shen, Ying, Zhang, Wentao

arXiv.org Artificial IntelligenceNov-25-2025

While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.18121

Country: Asia (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.67)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
Leisure & Entertainment (0.46)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

KForge: Program Synthesis for Diverse AI Hardware Accelerators

Sereda, Taras, John, Tom St., Bartan, Burak, Serrino, Natalie, Katti, Sachin, Asgar, Zain

arXiv.org Artificial IntelligenceNov-18-2025

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.

large language model, logic & formal reasoning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.13274

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Hardware (0.73)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.93)

Add feedback

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Yi, Zihao, Jiang, Qingxuan, Ma, Ruotian, Chen, Xingyu, Yang, Qu, Wang, Mengru, Ye, Fanghua, Shen, Ying, Tu, Zhaopeng, Li, Xiaolong, Linus, null

arXiv.org Artificial IntelligenceNov-13-2025

Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.04962

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Zhou, Xiyuan, Wang, Xinlei, He, Yirui, Wu, Yang, Zou, Ruixi, Cheng, Yuheng, Xie, Yulu, Liu, Wenxuan, Zhao, Huan, Xu, Yan, Gu, Jinjin, Zhao, Junhua

arXiv.org Artificial IntelligenceSep-23-2025

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.17677

Country:

Asia > China (0.28)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

Tomilin, Tristan, Boogaard, Luka van den, Garcin, Samuel, Grooten, Bram, Fang, Meng, Du, Yali, Pechenizkiy, Mykola

arXiv.org Artificial IntelligenceSep-9-2025

Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms, with environment availability strongly impacting research. One particularly underexplored intersection is continual learning (CL) in cooperative multi-agent settings. To remedy this, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark tailored for continual multi-agent reinforcement learning (CMARL). Existing CL benchmarks run environments on the CPU, leading to computational bottlenecks and limiting the length of task sequences. MEAL leverages JAX for GPU acceleration, enabling continual learning across sequences of 100 tasks on a standard desktop PC in a few hours. We show that naively combining popular CL and MARL methods yields strong performance on simple environments, but fails to scale to more complex settings requiring sustained coordination and adaptation. Our ablation study identifies architectural and algorithmic features critical for CMARL on MEAL.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2506.1499

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.88)

Add feedback

Experimental Assessment of a Multi-Class AI/ML Architecture for Real-Time Characterization of Cyber Events in a Live Research Reactor

Dahm, Zachery, Vasili, Konstantinos, Theos, Vasileios, Gkouliaras, Konstantinos, Richards, William, Miller, True, Jowers, Brian, Chatzidakis, Stylianos

arXiv.org Artificial IntelligenceSep-3-2025

There is increased interest in applying Artificial Intelligence and Machine Learning (AI/ML) within the nuclear industry and nuclear engineering community. Effective implementation of AI/ML could offer benefits to the nuclear domain, including enhanced identification of anomalies, anticipation of system failures, and operational schedule optimization. However, limited work has been done to investigate the feasibility and applicability of AI/ML tools in a functioning nuclear reactor. Here, we go beyond the development of a single model and introduce a multi-layered AI/ML architecture that integrates both information technology and operational technology data streams to identify, characterize, and differentiate (i) among diverse cybersecurity events and (ii) between cyber events and other operational anomalies. Leveraging Purdue Universitys research reactor, PUR-1, we demonstrate this architecture through a representative use case that includes multiple concurrent false data injections and denial-of-service attacks of increasing complexity under realistic reactor conditions. The use case includes 14 system states (1 normal, 13 abnormal) and over 13.8 million multi-variate operational and information technology data points. The study demonstrated the capability of AI/ML to distinguish between normal, abnormal, and cybersecurity-related events, even under challenging conditions such as denial-of-service attacks. Combining operational and information technology data improved classification accuracy but posed challenges related to synchronization and collection during certain cyber events. While results indicate significant promise for AI/ML in nuclear cybersecurity, the findings also highlight the need for further refinement in handling complex event differentiation and multi-class architectures.

artificial intelligence, balance ratio, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.00076

Country: North America > United States > Indiana > Tippecanoe County (0.15)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Energy > Power Industry > Utilities > Nuclear (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.90)

Add feedback

ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction

Yuan, Xun, Pham, Derek, Davidson, Sam, Yu, Zhou

arXiv.org Artificial IntelligenceAug-27-2025

Currently available grammatical error correction (GEC) datasets are compiled using well-formed written text, limiting the applicability of these datasets to other domains such as informal writing and dialog. In this paper, we present a novel parallel GEC dataset drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model, resulting in a 16 point increase in model precision. This is of particular importance in a GEC model, as model precision is considered more important than recall in GEC tasks since false positives could lead to serious confusion in language learners. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehensibility, making our dataset both reproducible and extensible. Experimental results show the effectiveness of our data in improving GEC model performance in conversational scenario.

data quality, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2022.naacl-main.5

2112.08466

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Education > Curriculum > Subject-Specific Education (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

A BERT-based Hierarchical Classification Model with Applications in Chinese Commodity Classification

Liu, Kun, Liu, Tuozhen, Wang, Feifei, Pan, Rui

arXiv.org Artificial IntelligenceAug-26-2025

Existing e-commerce platforms heavily rely on manual annotation for product categorization, which is inefficient and inconsistent. These platforms often employ a hierarchical structure for categorizing products; however, few studies have leveraged this hierarchical information for classification. Furthermore, studies that consider hierarchical information fail to account for similarities and differences across various hierarchical categories. Herein, we introduce a large-scale hierarchical dataset collected from the JD e-commerce platform (www.JD.com), comprising 1,011,450 products with titles and a three-level category structure. By making this dataset openly accessible, we provide a valuable resource for researchers and practitioners to advance research and applications associated with product categorization. Moreover, we propose a novel hierarchical text classification approach based on the widely used Bidirectional Encoder Representations from Transformers (BERT), called Hierarchical Fine-tuning BERT (HFT-BERT). HFT-BERT leverages the remarkable text feature extraction capabilities of BERT, achieving prediction performance comparable to those of existing methods on short texts. Notably, our HFT-BERT model demonstrates exceptional performance in categorizing longer short texts, such as books.

category, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.158

Country: Asia > China (0.15)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Appliances & Durable Goods (0.72)
Information Technology > Services > e-Commerce Services (0.56)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.89)

Add feedback